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Abstract 

This  paper  presents  methods  to  visualize  feature  spaces 
commonly  used  in  object  detection.  The  tools  in  this  paper 
allow  a  human  to  put  on  ‘ feature  space  glasses”  and  see 
the  visual  world  as  a  computer  might  see  it.  We  found  that 
these  “glasses”  allow  us  to  gain  insight  into  the  behavior 
of  computer  vision  systems.  We  show  a  variety  of  experi¬ 
ments  with  our  visualizations ,  such  as  examining  the  linear 
separability  of  recognition  in  HOG  space,  generating  high 
scoring  “super  objects”  for  an  object  detector,  and  diag¬ 
nosing  false  positives.  We  pose  the  visualization  problem  as 
one  of  feature  inversion,  i.e.  recovering  the  natural  image 
that  generated  a  feature  descriptor.  We  describe  four  algo¬ 
rithms  to  tackle  this  task,  with  different  trade-offs  in  speed, 
accuracy,  and  scalability.  Our  most  successful  algorithm 
uses  ideas  from  sparse  coding  to  learn  a  pair  of  dictionar¬ 
ies  that  enable  regression  between  HOG  features  and  natu¬ 
ral  images,  and  can  invert  features  at  interactive  rates.  We 
believe  these  visualizations  are  useful  tools  to  add  to  an  ob¬ 
ject  detector  researcher’s  toolbox,  and  code  is  available. 


1.  Introduction 

A  core  building  block  for  most  modern  recognition  sys¬ 
tems  is  a  histogram  of  oriented  gradients  (HOG)  [5].  While 
machines  struggle  to  comprehend  raw  pixel  values,  HOG 
provides  computers  with  a  higher  level  representation  of  an 
image.  The  computational  power  of  this  representation  has 
been  substantially  demonstrated  by  the  community  in  object 
detection  [3,  10,  19,  25,  32]  as  well  as  scene  classification 
[22,  30]  and  motion  tracking  [2,  11]. 

Yet,  the  human  vision  system  processes  photons — not 
high  dimensional  vectors — making  human  interpretation  of 
HOG  features  potentially  counter-intuitive.  As  object  detec¬ 
tion  researchers,  we  often  spend  considerable  time  staring  at 
false  positives  and  asking  ourselves:  why  does  our  detector 
think  there  is  a  microwave  flying  in  the  sky? 


*This  paper  is  a  pre-print  of  our  conference  paper.  Last  modified  De¬ 
cember  23,  2012. 


(a)  HOG 


(b)  Inverse  (This  Paper) 


(c)  Original 


Figure  1:  In  this  paper,  we  present  several  algorithms  for 
inverting  HOG  descriptors  back  to  images.  The  middle  col¬ 
umn  is  generated  only  from  HOG  features. 


In  this  paper,  we  attempt  to  give  humans  a  microscope 
into  the  world  of  HOG.  We  present  four  algorithms  for  vi¬ 
sualizing  and  inverting  HOG  features  back  into  natural  im¬ 
ages.  Each  algorithm  has  different  trade-offs,  varying  in 
speed,  accuracy,  and  scalability.  Some  of  our  algorithms 
use  large  databases;  some  are  parametric.  All  of  our  algo¬ 
rithms  are  simple  to  use  and  understand.1 

Our  visualizations,  shown  in  Fig.l,  are  intuitive  for  hu¬ 
mans  to  grasp  while  still  remaining  true  to  the  information 
stored  inside  each  HOG  feature,  a  claim  we  support  with  a 
user  study.  We  found  that  this  visualization  power  can  give 
us  insight  into  the  behavior  of  object  detectors.  For  exam¬ 
ple,  when  we  invert  the  false  positives  for  an  object  detector, 

^ode  is  available  at  http  :  /  /mit .  edu/vondrick/ihog. 
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Figure  2:  Inverting  HOG  features  using  exemplar  LDA.  We  train  an  exemplar  LDA  model  on  the  HOG  descriptor  we  wish 
to  invert  and  apply  it  to  a  large  database.  The  left  hand  side  of  the  above  equation  are  the  top  detections,  while  the  right  hand 
side  shows  the  average  of  the  top  100.  Even  though  all  top  detections  are  semantically  meaningless,  their  average  is  close  to 
the  original  image,  shown  on  the  right.  Notice  that  all  the  top  detections  share  structure  with  the  original,  e.g.,  the  top  left 
bottles  create  the  smoke  stack  for  the  ship,  and  the  middle  right  hands  compose  the  wings  for  the  bird. 


we  find  the  inversions  look  like  true  positives.  This  result 
suggests  that  the  false  positives  are  reasonable,  and  higher 
level  reasoning  may  be  necessary  to  solve  object  detection. 
By  observing  the  visual  world  as  object  detectors  see  it,  we 
can  more  clearly  understand  object  recognition  systems. 

The  contributions  in  this  paper  are  two-fold.  First,  we  of¬ 
fer  four  algorithms  to  invert  HOG  features.  Second,  we  use 
our  inversion  algorithms  to  examine  the  behavior  of  object 
detectors.  To  this  end,  in  section  2,  we  briefly  review  related 
work  in  reconstructing  images  given  their  feature  descrip¬ 
tors.  In  section  3,  we  describe  four  algorithms  for  inverting 
and  visualizing  HOG  features.  Although  we  focus  on  HOG 
in  this  paper,  our  approach  is  general  and  can  be  applied  to 
other  features  as  well.  In  section  4,  we  evaluate  the  perfor¬ 
mance  of  our  visualizations  with  a  human  study  by  asking 
subjects  to  identify  objects  given  only  their  inverse.  Finally, 
in  section  5,  we  present  a  variety  of  experiments  using  HOG 
inversion  to  visualize  the  behavior  of  object  detectors. 

2.  Related  Work 

There  has  been  relatively  little  work  in  feature  inversion 
so  far.  Torralba  and  Oliva,  in  early  work  [27],  described 
a  simple  iterative  procedure  to  recover  images  only  given 
gist  descriptors  [21].  Weinzaepfel  et  al.  [29]  were  the  first 
to  reconstruct  an  image  given  its  keypoint  SIFT  descrip¬ 
tors  [17].  Their  approach  obtains  compelling  reconstruc¬ 
tions  using  a  nearest  neighbor  based  approach  on  a  mas¬ 
sive  database.  We  encourage  readers  to  see  their  full  color 
reconstructions.  However,  their  approach  only  focuses  on 
sparse  keypoint  SIFT  descriptors.  Since  most  object  detec¬ 
tors  use  a  dense  histogram  of  visual  features,  we  instead 
present  algorithms  for  inverting  histogram  features  for  de¬ 
tection,  the  most  popular  of  which  is  HOG.  Our  algorithms 
are  also  quick,  allowing  for  nearly  real  time  visualization, 
d’ Angelo  et  al.  [( ]  further  developed  an  algorithm  to  recon¬ 
struct  images  given  only  LBP  features  [4,  ].  Their  method 
analytically  solves  for  the  inverse  image  and  does  not  re¬ 
quire  a  dataset.  While  [29,  6,  27]  do  a  good  job  at  recon¬ 


structing  images  from  SIFT,  LBP,  and  gist  features,  to  our 
knowledge,  this  paper  is  the  first  to  invert  HOG. 

This  work  also  complements  a  recent  line  of  papers  that 
examine  object  detectors.  Hoiem  et  al.  [L  ]  performed  a 
large  study  analyzing  the  errors  that  object  detectors  make. 
Parikh  and  Zitnick  [23]  introduced  a  paradigm  for  human 
debugging  of  object  detectors.  Divvala  et  al.  [7]  analyze 
part-based  detectors  to  determine  the  importance  of  each 
piece  in  the  object  detection  stack.  Tatu  et  al.  [24]  explored 
the  set  of  images  that  generate  identical  HOG  descriptors. 
Zhu  et  al.  [3  ]  try  to  determine  whether  we  have  reached 
Bayes  risk  for  HOG.  In  this  paper,  we  analyze  object  detec¬ 
tors  by  direct  inspection:  we  visualize  the  world  as  comput¬ 
ers  see  it  by  inverting  HOG. 

3.  Feature  Inversion  Algorithms 

Let  x  G  be  an  image  and  y  =  cj)(x)  be  the  corre¬ 
sponding  HOG  feature  descriptor.  Since  </>(•)  is  a  many-to- 
one  function,  no  analytic  inverse  exists.  Hence,  we  seek  an 
image  x  that,  when  computed  HOG  on  it,  closely  matches 
the  original  descriptor  y : 

<P~1(y)  =  argmin  ||</>(x)  —  y\\^  (1) 

xeRD 

Optimizing  Eqn.l  is  challenging.  Although  Eqn.l  is  non- 
convex,  we  tried  gradient-descent  strategies  by  numerically 
evaluating  the  derivative  in  image  space  with  Newton’s 
method.  Unfortunately,  we  observed  poor  results,  likely  be¬ 
cause  HOG  is  both  highly  sensitive  to  noise  and  Eqn.l  has 
frequent  local  minimas.  In  the  remainder  of  this  section,  we 
present  four  different  algorithms  for  inverting  HOG. 

3.1.  Algorithm  A:  Exemplar  LDA  (ELDA) 

Consider  the  top  detections  for  the  exemplar  object  de¬ 
tector  [12,  19]  for  a  few  images  shown  in  Fig. 2.  Although 
all  top  detections  are  false  positives,  notice  that  each  detec¬ 
tion  captures  some  statistics  about  the  query.  Even  though 
the  detections  are  wrong,  if  we  squint,  we  can  see  parts  of 
the  original  object  appear  in  each  detection. 
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We  use  this  simple  observation  to  produce  our  first  inver¬ 
sion  algorithm.  Suppose  we  wish  to  invert  HOG  feature  y. 
We  first  train  an  exemplar  LDA  detector  [12]  for  this  query, 
w  =  Ti~1(y  —  fi).  We  then  score  w  against  every  slid¬ 
ing  window  on  a  large  database.  The  HOG  inverse  is  then 
simply  the  average  of  the  top  K  detections  in  RGB  space: 
(j)~^{y)  =  ~R  YU=i  zi  where  Zi  is  a  top  detection. 

This  method,  although  simple,  produces  surprisingly  ac¬ 
curate  reconstructions,  even  when  the  database  does  not 
contain  the  category  of  the  HOG  template.  We  note  that 
this  method  may  be  subject  to  dataset  bias  [26]  but  could 
be  overcome  [15].  We  also  point  out  that  a  similar  nearest 
neighbor  method  is  used  in  brain  research  to  visualize  what 
a  person  might  be  seeing  [20]. 


3.2.  Algorithm  B:  Ridge  Regression 


Unfortunately,  running  an  object  detector  across  a  large 
database  is  computationally  expensive.  In  this  section,  we 
present  a  fast,  parametric  inversion  algorithm. 

Let  X  G  be  a  random  variable  representing  a  gray 
scale  image  and  Y  e  be  a  random  variable  of  its  corre¬ 
sponding  HOG  point.  We  define  these  random  variables 
to  be  normally  distributed  on  a  D  +  d-variate  Gaussian 
P(X,Y)  ~  A/"(/i, £)  with  parameters  y  =  and 


E  = 


iT 


.  In  order  to  invert  a  HOG  feature  y,  we 


'-‘XY 

calculate  the  most  likely  image  from  the  Gaussian  P  condi¬ 
tioned  on  Y  my: 

^(y)  =  argmax.P(X  =  x\Y  =  y)  (2) 

xerd 


It  is  well  known  that  Gaussians  have  a  closed  form  condi¬ 
tional  mode: 


0 b 1(v)  =  ^xyYYy{v  -  /iy)  +  yx  (3) 

Under  this  inversion  algorithm,  any  HOG  point  can  be  in¬ 
verted  by  a  single  matrix  multiplication,  allowing  for  inver¬ 
sion  in  under  a  second. 

We  estimate  y  and  E  on  a  large  database.  In  practice,  E 
is  not  positive  definite;  we  add  a  small  uniform  prior  (i.e., 
E  =  E  +  XI)  so  E  can  be  inverted.  Since  we  wish  to  invert 
any  HOG  point,  we  assume  that  P(X,  Y)  is  stationary  [12], 
allowing  us  to  efficiently  learn  the  covariance  across  mas¬ 
sive  datasets.  We  invert  a  arbitrary  dimensional  HOG  point 
by  marginalizing  out  unused  dimensions. 

3.3.  Algorithm  C:  Direct  Optimization 

We  found  that  ridge  regression  yields  blurred  inversions. 
Intuitively,  since  HOG  is  invariant  to  shifts  up  to  its  bin  size, 
there  are  many  images  that  map  to  the  same  HOG  point. 
Ridge  regression  is  reporting  the  statistically  most  likely 
image,  which  is  the  average  over  all  shifts.  This  causes 
ridge  regression  to  only  recover  the  low  frequencies  of  the 
original  image. 


Figure  3:  Some  pairs  of  dictionaries  for  U  and  V.  The  left 
of  every  pair  is  the  gray  scale  dictionary  element  and  the 
right  is  the  positive  components  elements  in  the  HOG  dic¬ 
tionary.  Notice  that  the  gray  patches  are  correlated  with  the 
HOG  patches. 


We  now  provide  an  algorithm  to  recover  the  high  fre¬ 
quencies.  Let  U  G  RDxK  be  a  natural  image  basis  (e.g., 
the  first  K  eigenvectors  of  Sxx  G  RDxD).  Any  image 
x  G  can  be  encoded  by  coefficients  p  G  RK  in  this  ba¬ 
sis:  x  =  Up.  Since  ridge  regression  only  recovers  the  first 
few  principal  components  of  U,  there  is  a  residual  term  of 
high  frequencies  left  to  be  recovered: 

K  K 

X  =  YjUpi  =  Low  +  High  =  4>^{y)  +  'Y^JU  (4) 

i=l  i=J 

where  was  able  to  only  recover  J  components.  The 

goal  of  our  third  approach  is  to  explicitly  recover  the  high 
frequency  components,  i.e.  the  second  term.  We  wish  to 
minimize: 

4>P  ( V )  =  argmin  1 1  <j>(  A^1  (y)  +  Up)-y\\t  (5) 

for  some  hyperparameter  A  G  M.  Empirically  we  found 
success  optimizing  Eqn.5  using  coordinate  descent  on  p 
with  random  restarts.  We  use  an  over-complete  basis  cor¬ 
responding  to  sparse  Gabor-like  filters  for  U.  We  compute 
the  eigenvectors  of  Exx  across  different  scales  and  trans¬ 
late  smaller  eigenvectors  to  form  U. 

3.4.  Algorithm  D:  Paired  Dictionary  Learning 

Direct  optimization  obtains  highly  accurate  results,  but 
since  optimization  requires  computing  HOG  features  on  a 
large  number  of  candidate  images,  convergence  is  slow.  In 
our  final  algorithm,  we  propose  a  fast  approximation. 

Let  x  G  Rd  be  an  image  and  y  G  Rd  be  its  HOG  de¬ 
scriptor.  The  key  observation  is  that  if  we  write  x  and  y  in 
terms  of  bases  V  G  WDxK  and  V  G  RdxK  respectively, 
but  with  shared  coefficients  a  G  MK, 

x  =  Uol  and  y  =  Va  (6) 
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then  inversion  can  be  obtained  by  first  projecting  the  HOG 
features  y  onto  the  HOG  basis  V,  then  projecting  a  into  the 
natural  image  basis  U : 


<t>D1hj)  =  u&  <7) 

where  a  =  argmin  | \y  —  V a\ |  s.t.  ||a||i  <  A  (8) 

OL 

Since  efficient  solvers  for  Eqn.8  exist  [18,  16],  we  are  able 
to  invert  HOG  patches  in  under  a  second. 

This  paired  dictionary  trick  requires  finding  appropriate 
bases  U  and  V  such  that  Eqn.6  holds.  To  do  this,  we  solve 
a  paired  dictionary  learning  problem,  inspired  by  recent  su¬ 
perresolution  sparse  coding  work  [31,  2<  ]: 


N 

argmin  Y]  (||xi  -  Ua^W  +  \\(p(xi)  -  Va^ 
u,v,a 

s.t.  |M|i<AVi,  ||C/|||<71,  l|V|||<72 


(9) 


After  a  few  algebraic  manipulations,  the  above  objective 
simplifies  to  a  standard  sparse  coding  and  dictionary  learn¬ 
ing  problem  with  concatenated  dictionaries,  which  we  opti¬ 
mize  using  SPAMS  [18].  Optimization  typically  took  a  few 
hours  on  medium  sized  problems.  We  estimate  U  and  V 
with  a  dictionary  size  K  G  O(103)  and  training  samples 
N  G  O(106)  from  a  large  database.  See  Fig. 3  for  a  visual¬ 
ization  of  the  learned  dictionary  pairs. 

Unfortunately,  the  paired  dictionary  learning  formula¬ 
tion  suffers  on  problems  of  nontrivial  scale.  In  practice,  we 
only  learn  dictionaries  for  5  x  5  HOG  templates.  In  order 
to  invert  a  w  x  h  HOG  template  y ,  we  invert  every  5x5 
subpatch  inside  y  and  average  overlapping  patches  in  the  fi¬ 
nal  reconstruction.  We  found  that  this  approximation  works 
well  in  practice. 


4.  Evaluation 

In  this  section,  we  evaluate  our  four  inversion  algorithms 
using  both  qualitative  and  quantitative  measures.  We  use 
PASCAL  VOC  2011  [8]  as  our  dataset  and  we  invert  patches 
corresponding  to  objects.  Any  algorithm  that  required  train¬ 
ing  could  only  access  the  training  set.  During  evaluation, 
only  images  from  the  validation  set  are  examined.  The 
database  for  exemplar  LDA  excluded  the  category  of  the 
patch  we  were  inverting  to  reduce  the  effect  of  biases. 

We  show  our  inversions  in  Fig.4  for  a  few  object  cate¬ 
gories.  Exemplar  LDA  and  ridge  regression  tend  to  pro¬ 
duce  blurred  visualizations.  Direct  optimization  recovers 
high  frequency  details  at  the  expense  of  extra  noise.  Paired 
dictionary  learning  produces  the  best  visualization  for  HOG 
descriptors.  By  learning  a  sparse  dictionary  over  the  visual 
world  and  the  correlation  between  HOG  and  natural  images, 
paired  dictionary  learning  recovered  high  frequencies  with¬ 
out  introducing  significant  noise. 


(a)  Original  (b)  ELDA 


(c)  Ridge  (d)  Direct  (e)  PairDict 


Figure  4:  We  show  the  results  for  all  four  of  our  inversion 
algorithms  on  held  out  image  patches  on  similar  dimensions 
common  for  object  detection.  In  general,  exemplar  LDA 
produces  grainy  inversions.  Ridge  regression  is  blurry,  but 
fast.  Direct  optimization  is  able  to  recover  high  frequencies 
at  the  expense  of  extra  noise;  notice  the  eyes  on  the  sheep 
and  cat,  and  details  on  the  bus.  Paired  dictionary  learn¬ 
ing  often  perceptually  performs  the  best,  striking  a  middle 
ground  between  crisp  and  blurry. 
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Category 

ELDA 

Ridge 

Direct 

PairDict 

aeroplane 

0.634 

0.633 

0.596 

0.609 

bicycle 

0.452 

0.577 

0.513 

0.561 

bus 

0.627 

0.632 

0.587 

0.585 

cat 

0.749 

0.712 

0.687 

0.705 

cow 

0.720 

0.663 

0.632 

0.650 

horse 

0.686 

0.633 

0.586 

0.635 

tvmonitor 

0.711 

0.640 

0.638 

0.629 

Mean 

0.671 

0.656 

0.620 

0.637 

Table  1 :  We  evaluate  the  performance  of  our  inversion  al¬ 
gorithm  by  comparing  the  inverse  to  the  ground  truth  image 
using  the  mean  normalized  cross  correlation.  Higher  is  bet¬ 
ter;  a  score  of  1  is  perfect.  In  general,  exemplar  LDA  does 
slightly  better  at  reconstructing  the  original  pixels. 

SIFT  Comparison:  We  compare  our  HOG  inversions 
against  SIFT  reconstructions  on  the  INRIA  Holidays  dataset 
[14].  Fig. 5  shows  a  qualitative  comparison  between  paired 
dictionary  learning  and  Weinzaepfel  et  al.  [29].  Notice  that 
HOG  inversion  is  more  blurred  than  key  point  SIFT  since 
HOG  is  histogram  based. 

Dimensionality:  HOG  inversions  are  sensitive  to  the  di¬ 
mensionality  of  their  templates.  For  medium  (10  x  10) 
to  large  templates  (40  x  40),  we  obtain  reasonable  perfor¬ 
mance.  But,  for  small  templates  (5  x  5)  the  inversion  is 
blurred.  Fig. 6  shows  examples  as  the  HOG  descriptor  di¬ 
mensionality  changes. 

In  the  remainder  of  this  section,  we  evaluate  our  algo¬ 
rithms  under  two  benchmarks:  first,  an  inversion  metric  that 
measures  how  well  our  inversions  reconstruct  the  original 
images,  and  second,  a  visualization  challenge  conducted  on 
Amazon  Mechanical  Turk  designed  to  determine  how  well 
people  can  infer  the  original  category  from  the  inverse.  The 
first  experiment  measures  the  algorithm’s  reconstruction  er¬ 
ror,  while  the  second  experiment  analyzes  the  recovery  of 
high  level  semantics. 

4.1.  Inversion  Benchmark 

We  consider  the  inversion  performance  of  our  algorithm: 
given  a  HOG  feature  y ,  how  well  does  our  inverse  ) 
reconstruct  the  original  pixels  x  for  each  algorithm?  Since 
HOG  is  invariant  up  to  a  constant  shift  and  scale,  we  score 
each  inversion  against  the  original  image  with  normalized 
cross  correlation.  Our  results  are  shown  in  Tab.l.  Overall, 
exemplar  LDA  does  the  best  at  pixel  level  reconstruction. 

4.2.  Visualization  Benchmark 

While  the  inversion  benchmark  evaluates  how  well  the 
inversions  reconstruct  the  original  image,  it  does  not  cap¬ 
ture  the  high  level  content  of  the  inverse:  is  the  inverse  of  a 
sheep  still  a  sheep?  To  evaluate  this,  we  conducted  a  study 
on  Amazon  Mechanical  Turk.  We  sampled  2,000  windows 


Figure  5:  We  compare  our  paired  dictionary  learning  ap¬ 
proach  on  HOG  with  the  algorithm  of  [29]  on  SIFT.  Since 
HOG  is  invariant  to  color,  we  are  only  able  to  recover  a 
grayscale  image.  Furthermore,  our  blurred  inversion  shows 
that  HOG  is  a  more  coarse  descriptor  than  keypoint  SIFT. 


Figure  6:  Our  inversion  algorithms  are  sensitive  to  the  HOG 
template  size.  Larger  templates  are  easier  to  invert  since 
they  are  less  invariant.  We  show  how  performance  degrades 
as  the  template  becomes  smaller.  Dimensions  in  HOG  space 
shown:  40  x  40,  20  x  20,  10  x  10,  and  5x5. 

corresponding  to  objects  in  PASCAL  VOC  2011.  We  then 
showed  participants  an  inversion  from  one  of  our  algorithms 
and  asked  users  to  classify  it  into  one  of  the  20  categories. 
Each  window  was  shown  to  three  different  users.  Users 
were  required  to  pass  a  training  course  and  qualification 
exam  before  participating  in  order  to  guarantee  users  unser- 
stood  the  task.  Users  could  optionally  select  that  they  were 
not  confident  in  their  answer.  We  also  compared  our  al¬ 
gorithms  against  the  standard  black-and-white  HOG  glyph 
popularized  by  [5]. 

Our  results  in  Tab.  2  show  that  paired  dictionary  learn¬ 
ing  and  direct  optimization  provide  the  best  visualization 
of  HOG  descriptors  for  humans.  Ridge  regression  and  ex¬ 
emplar  LDA  performs  better  than  the  glyph,  but  they  suf¬ 
fer  from  blurred  inversions.  Human  performance  on  the 
HOG  glyph  is  generally  poor,  and  participants  were  even 
the  slowest  at  completing  that  study.  Interestingly,  the  glyph 
does  the  best  job  at  visualizing  bicycles,  likely  due  to  their 
unique  circular  gradients.  Overall,  our  results  suggest  that 
visualizing  HOG  with  the  glyph  is  misleading,  and  using 
richer  diagrams  is  useful  for  interpreting  HOG  vectors. 

There  is  strong  correlation  with  the  accuracy  of  humans 
classifying  the  HOG  inversions  with  the  performance  of 
HOG  based  object  detectors.  We  found  human  classifica¬ 
tion  accuracy  on  inversions  and  the  state-of-the-art  object 
detection  AP  scores  from  [9]  are  correlated  with  a  Spear¬ 
man’s  rank  correlation  coefficient  of  0.77.  This  result  sug- 
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Category 

ELDA  Ridge  Direct  PairDict  Glyph 

Expert 

bicycle 

0.327 

0.127 

0.362 

0.307 

0.405 

0.438 

bird 

0.364 

0.263 

0.378 

0.372 

0.193 

0.059 

bottle 

0.269 

0.282 

0.283 

0.446 

0.312 

0.222 

car 

0.397 

0.457 

0.617 

0.585 

0.359 

0.389 

cat 

0.219 

0.178 

0.381 

0.199 

0.139 

0.286 

chair 

0.099 

0.239 

0.223 

0.386 

0.119 

0.167 

table 

0.152 

0.064 

0.162 

0.237 

0.071 

0.125 

horse 

0.260 

0.290 

0.354 

0.446 

0.144 

0.150 

motorbike 

0.221 

0.232 

0.396 

0.224 

0.298 

0.350 

person 

0.458 

0.546 

0.502 

0.676 

0.301 

0.375 

sofa 

0.138 

0.100 

0.162 

0.293 

0.104 

0.000 

Mean 

0.282 

0.258 

0.355 

0.383 

0.191 

0.233 

Table  2:  We  evaluate  visualization  performance  across 
twenty  PASCAL  VOC  categories  by  asking  Mechanical 
Turk  workers  to  classify  our  inversions.  Numbers  are  per¬ 
cent  classified  correctly;  higher  is  better.  Chance  is  0.05. 
Glyph  refers  to  the  standard  black-and-white  HOG  diagram 
popularized  by  [i  ] .  Paired  dictionary  learning  provides  the 
best  visualizations  for  humans.  Expert  refers  to  PhD  stu¬ 
dents  in  computer  vision  performing  the  same  visualization 
challenge  with  HOG  glyphs.  Notice  that  even  HOG  experts 
can  benefit  from  paired  dictionary  learning.  Interestingly, 
the  glyph  is  best  for  bicycles. 


ELDA  Ridge  Direct 


Figure  7:  We  show  the  confusion  matrices  for  each  of  our 
four  algorithms  as  well  as  the  standard  HOG  black-and- 
white  glyph  visualization.  The  vertical  axis  is  the  ground 
truth  category  and  the  horizontal  axis  is  the  predicted  cat¬ 
egory.  Notice  that  common  confusions  are  similar  to  er¬ 
rors  caused  made  by  detectors.  The  expert  confusion  matrix 
refers  to  the  workers  who  are  computer  vision  PhD  students. 

gests  that  humans  can  predict  the  performance  of  object  de¬ 
tectors  by  only  looking  at  HOG  visualizations. 

Fig. 7  shows  the  classification  confusion  matrix  for  all 
algorithms.  Participants  tended  to  make  the  same  mistakes 
that  object  detectors  make.  Notice  that  bottles  are  often  con¬ 
fused  with  people,  motorbikes  with  bicycles,  and  animals 
with  other  animals.  Users  incorrectly  showed  a  strong  prior 
that  the  inversions  were  for  people,  evidenced  by  a  bright 


(a)  Human  Vision  (b)  HOG  Vision 


Figure  8:  HOG  inversion  reveals  the  world  that  object  de¬ 
tectors  see.  The  left  shows  a  man  standing  in  a  dark  room. 
If  we  compute  HOG  on  this  image  and  invert  it,  the  previ¬ 
ously  dark  scene  behind  the  man  emerges.  Notice  the  wall 
structure,  the  lamp  post,  and  the  chair  in  the  bottom  right 
hand  corner. 

vertical  bar  in  the  confusion  matrix. 

We  also  asked  computer  vision  PhD  students  to  classify 
HOG  glyphs  in  order  to  compare  Mechanical  Turk  workers 
with  experts  in  HOG.  Our  results  are  summarized  in  the  last 
column  of  Tab.  2.  HOG  experts  performed  slightly  better 
than  common  people  on  the  glyph  challenge,  but  experts  on 
glyphs  did  not  beat  common  people  on  other  visualizations. 
This  result  suggests  that  our  algorithms  produce  more  intu¬ 
itive  visualizations  even  for  object  detection  researchers. 

5.  Experiments 

The  underlying  motivation  of  this  paper  has  been  to  de¬ 
velop  feature  inversion  algorithms  and  use  these  visualiza¬ 
tions  to  analyze  object  detectors.  In  this  section,  we  present 
several  experiments  that  use  our  inversions  to  put  on  “HOG 
glasses”  to  analyze  the  behvaior  of  HOG. 

How  Detectors  See  the  World:  In  our  first  experiment, 
we  attempt  to  reveal  how  object  detectors  see  the  visual 
world.  Fig. 8a  shows  a  normal  photograph  of  a  man,  but 
Fig. 8b  shows  how  HOG  sees  the  same  man.  Since  HOG 
is  invariant  to  illumination  changes,  the  background  of  the 
scene,  invisible  to  the  human  eye,  materializes,  demonstrat¬ 
ing  the  clutter  that  HOG  catches. 

Top  False  Positives:  Seeing  the  world  through  the  eyes 
of  HOG  can  be  helpful  for  understanding  object  detector 
errors.  We  train  a  single  mixture  component  using  SVM 
and  HOG.  Fig. 9  shows  the  top  false  detections  for  a  few 
categories  and  their  inverses.  Notice  that  the  inversions  look 
like  the  positive  class  while  the  original  image  patch  does 
not.  This  experiment  suggests  that  the  false  positives  that 
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Horse 


Figure  9:  We  trained  a  single  mixture  component  model 
with  SVM  for  a  few  classes.  This  figure  shows  some  of 
the  top  false  positives  and  their  inversions.  Notice  that  the 
inversions  look  like  true  positives — the  airplane’s  wings  and 
body,  and  horse’s  legs  and  torso  appear  in  the  inversion,  but 
not  necessarily  in  the  original  image. 


object  detectors  predict  in  HOG  space  are  reasonable  and 
higher  level  reasoning  may  be  necessary  to  improve  object 
recognition  performance. 

Interpolation  in  Feature  Space:  Since  object  detection  is 
computationally  expensive,  most  state-of-the-art  object  de¬ 
tectors  today  depend  on  linear  classifiers.  Fig.  10  analyzes 
whether  recognition  is  linear  separable  in  HOG  space  by 
inverting  the  midpoint  between  two  positive  examples.  Not 
surprisingly,  our  results  show  that  frequently  the  midpoint 
no  longer  resembles  the  positive  class.  Since  linear  classi¬ 
fiers  assume  that  the  midpoint  of  any  positive  example  is 
also  a  positive,  this  result  indicates  that  perfect  car  detec¬ 
tion  is  not  possible  with  a  single  linear  separator  in  HOG 
space.  Car  detection  may  be  solvable  with  view  based  mix¬ 
ture  components,  motivating  much  recent  work  in  increas¬ 
ing  model  complexity  [19,  10]. 

Prototypical  Objects:  We  analyze  an  object  detector’s 
prototypical  example  of  an  object.  Fig.  1 1  shows  the  positive 
component  of  the  weight  vector  for  a  few  object  detectors 
trained  with  [10].  The  prototypes  highlights  the  parts  of 
objects  that  each  detector  finds  discriminative.  Notice  how 
that  prototypes  look  similar  to  the  average  of  the  class. 

Super  Objects:  In  Fig.  12,  we  examine  how  the  appear¬ 
ance  of  objects  change  as  we  make  an  object  “more  posi¬ 
tive”  or  “more  negative.”  We  move  perpendicularly  to  the 
class  decision  boundary  in  HOG  space.  As  the  object  be¬ 
comes  more  and  more  positive,  the  key  gradients  become 
more  pronounced,  but  if  the  object  is  downgraded  towards 
the  negative  world,  the  object  starts  looking  like  noise.  This 
experiment  gives  an  intuitive  visualization  of  what  each  ob- 


•  t  • 

Positive  #1  Midpoint  Positive  #2 


Figure  10:  We  linearly  interpolate  between  examples  in 
HOG  space  and  invert  its  path.  First  two  rows:  occasion¬ 
ally,  the  interpolation  of  two  examples  is  still  in  the  posi¬ 
tive  class  even  under  extreme  viewpoint  change.  Last  two 
rows:  frequently,  however,  the  midpoint  is  no  longer  the 
positive.  This  demonstrates  that  a  single  linear  separator  in 
HOG  space  is  insufficient  for  perfect  object  detection. 


Figure  1 1 :  We  invert  the  positive  components  of  a  few  root 
templates  from  the  deformable  parts  model  [10].  Notice  the 
airplane  tail  wing,  the  right  facing  bus,  the  typical  bottle, 
and  a  person  leaning  his  head. 

ject  detector  finds  important. 

6.  Conclusion 

We  have  presented  four  algorithms  for  inverting  and  vi¬ 
sualizing  features  for  object  detection.  While  this  paper  has 
focused  on  HOG,  our  algorithms  are  general  and  can  be  ap¬ 
plied  to  any  feature  descriptor.  We  evaluated  our  method 
against  a  difficult  dataset  with  a  large  human  study  and  we 
presented  several  experiments  that  use  feature  inversion  in 
order  to  see  the  world  through  the  eyes  of  an  object  detector. 
Our  best  performing  algorithm,  paired  dictionary  learning, 
uses  ideas  from  sparse  coding  to  regress  between  feature 
descriptors  and  their  natural  images.  Since  efficient  solvers 
for  sparse  coding  now  exist,  we  are  able  to  invert  features 
at  nearly  interactive  rates.  We  hope  that  others  find  these 
visualizations  useful  in  their  own  research. 
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To  Negative  World  To  Positive  World 


Figure  12:  We  train  single  component,  linear  SVM  object 
detectors  with  HOG  for  a  variety  of  categories  and  translate 
in  HOG  space  orthogonal  to  the  decision  hyperplane.  Mov¬ 
ing  towards  the  right  is  making  the  object  more  positive  and 
to  the  left  is  making  it  more  negative.  The  full  color  im¬ 
age  on  the  right  is  the  original  image.  Moving  towards  the 
positive  world  causes  the  discriminative  gradients  of  the  ex¬ 
ample  to  increase,  and  moving  to  the  negative  world  causes 
the  example  to  become  more  like  background  noise. 
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